January 11, 2016
In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months ago.
Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.
From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:
"The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness."
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
Original conclusion: The risk of divorce in a heterosexual marriage increases when the wife falls ill, but not the husband.
Corrected conclusion: Based on the corrected analysis, we conclude that there are not gender differences in the relationship between gender, pooled illness onset, and divorce.
"The research environment is fast-paced given the ethos to “publish or perish"."
"[…] research is becoming increasingly complex, with greater calls for transdisciplinary collaborations, “big data,” and more sophisticated research questions and methods […] data sets often have multiple files that require merging, change the wording of questions over time, provide incomplete codebooks, and have unclear and sometimes duplicative variables."
"Given these issues, I would not be surprised if coding errors were fairly common, and that the ones discovered constitute only the "tip of the iceberg."
Source: Karl Broman
Your closest collaborator is you six months ago,
but you don’t reply to emails.
- Mark Holder
We need an environment where
data, analysis, and results are tightly connected, or better yet, inseparable
documentation is human readable and syntax is minimal
Scriptability \(\rightarrow\) R
Literate programming \(\rightarrow\) R Markdown
Version control \(\rightarrow\) Git / GitHub
Other considerations
Learning curve: Point-and-click software (supposedly) have shallower learning curves than scripting languages
Automation: Need to rerun your analysis with new/updated data? Just change the input file.
Collaboration: Sharing your analysis is as easy as sharing your scripts

There are a number of other great programming tools out there that can also be used to improve the reproducibility of your analysis
The key is to use some type of language that will allow you to automate and document your analysis
Once you master one language you'll probably find it easier to learn another
You could just type into the command prompt…
… but that doesn't help much with documentation
… but that doesn't help much with automation
"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."
"The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other."
With RStudio you can combine your programming and your documentation
Markdown is a lightweight markup language for creating HTML (or XHTML) documents.
Markup languages are designed to produce documents from human readable text (and annotations).
Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.
Well, it's R + Markdown
Ease of Markdown syntax
Rendering of R code to produce output and plots
big5 <- read.delim("raw-data/big5.txt") %>%
tbl_df() # for formatting
big5
## Source: local data frame [19,719 x 57] ## ## race age engnat gender hand source country E1 E2 E3 E4 ## (int) (int) (int) (int) (int) (int) (fctr) (int) (int) (int) (int) ## 1 3 53 1 1 1 1 US 4 2 5 2 ## 2 13 46 1 2 1 1 US 2 2 3 3 ## 3 1 14 2 2 1 1 PK 5 1 1 4 ## 4 3 19 2 2 1 1 RO 2 5 2 4 ## 5 11 25 2 2 1 2 US 3 1 3 3 ## 6 13 31 1 2 1 2 US 1 5 2 4 ## 7 5 20 1 2 1 5 US 5 1 5 1 ## 8 4 23 2 1 1 2 IN 4 3 5 3 ## 9 5 39 1 2 3 4 US 3 1 5 1 ## 10 3 18 1 2 1 5 US 1 4 2 5 ## .. ... ... ... ... ... ... ... ... ... ... ... ## Variables not shown: E5 (int), E6 (int), E7 (int), E8 (int), E9 (int), E10 ## (int), N1 (int), N2 (int), N3 (int), N4 (int), N5 (int), N6 (int), N7 ## (int), N8 (int), N9 (int), N10 (int), A1 (int), A2 (int), A3 (int), A4 ## (int), A5 (int), A6 (int), A7 (int), A8 (int), A9 (int), A10 (int), C1 ## (int), C2 (int), C3 (int), C4 (int), C5 (int), C6 (int), C7 (int), C8 ## (int), C9 (int), C10 (int), O1 (int), O2 (int), O3 (int), O4 (int), O5 ## (int), O6 (int), O7 (int), O8 (int), O9 (int), O10 (int)
You can include script files in your R Markdown document:
source("code/01-data-cleanup.R")
ggplot(big5, aes(x = age)) + geom_histogram()
summary(big5$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 13.00 18.00 22.00 26.26 31.00 99.00
Extraversion: Seeking fulfillment from sources outside the self or in community. High scorers are social, low scorers prefer to work alone. Neuroticism: Being emotional.
m_ext_age <- lm(extraversion ~ neuroticism * gender, data = big5) summary(m_ext_age)
## ## Call: ## lm(formula = extraversion ~ neuroticism * gender, data = big5) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.3125 -6.3391 0.0132 6.6079 26.0924 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.202758 0.190240 79.913 < 2e-16 ## neuroticism 0.297346 0.009615 30.925 < 2e-16 ## genderMale -1.893017 0.327308 -5.784 7.42e-09 ## genderOther -5.721794 2.177580 -2.628 0.00861 ## neuroticism:genderMale 0.001576 0.015226 0.104 0.91755 ## neuroticism:genderOther -0.008332 0.125205 -0.067 0.94694 ## ## Residual standard error: 8.854 on 19605 degrees of freedom ## (24 observations deleted due to missingness) ## Multiple R-squared: 0.08003, Adjusted R-squared: 0.0798 ## F-statistic: 341.1 on 5 and 19605 DF, p-value: < 2.2e-16
ggplot(data = big5, aes(x = neuroticism, y = extraversion, color = gender)) + geom_point(alpha = 0.5) + geom_jitter() + geom_smooth(method = "lm")
big5_teen <- filter(big5, age <= 19)
m_ext_age_teen <- lm(extraversion ~ age * gender, data = big5_teen) summary(m_ext_age_teen)
## ## Call: ## lm(formula = extraversion ~ age * gender, data = big5_teen) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.8426 -6.9399 0.0037 7.0601 22.6662 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.12536 1.43788 9.824 < 2e-16 ## age 0.30091 0.08502 3.539 0.000404 ## genderMale 6.78702 2.47559 2.742 0.006131 ## genderOther 6.66006 11.01228 0.605 0.545342 ## age:genderMale -0.42066 0.14590 -2.883 0.003949 ## age:genderOther -0.76174 0.66364 -1.148 0.251085 ## ## Residual standard error: 9.366 on 6740 degrees of freedom ## (10 observations deleted due to missingness) ## Multiple R-squared: 0.005666, Adjusted R-squared: 0.004929 ## F-statistic: 7.681 on 5 and 6740 DF, p-value: 3.274e-07
ggplot(data = big5_teen, aes(x = neuroticism, y = extraversion, color = gender)) + geom_point(alpha = 0.5) + geom_jitter() + geom_smooth(method = "lm")
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.
Source: Piled Higher and Deeper by Jorge Cham, http://www.phdcomics.com.
2013-10-14_manuscriptFish.doc
2013-10-30_manuscriptFish.doc
2013-11-05_manusctiptFish_intitialRyanEdits.doc
2013-11-10_manuscriptFish.doc
2013-11-11_manuscriptFish.doc
2013-11-15_manuscriptFish.doc
2013-11-30_manuscriptFish.doc
2013-12-01_manuscriptFish.doc
2013-12-02_manuscriptFish_PNASsubmitted.doc
2014-01-03_manuscriptFish_PLOSsubmitted.doc
2014-02-15_manuscriptFish_PLOSrevision.doc
2014-03-14_manuscriptFish_PLOSpublished.doc
Everytime you make a save, you zip the entire directory that your project files are in and save it with a date.
From Code for RNeXML R package, plus RNeXML publication in RMarkdown, https://github.com/ropensci/RNeXML.
Version control systems start with a base version of the document and then save just the changes you made at each step of the way.
You can think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.
From Software Carpentry.
From Software Carpentry.
Creates a easy navigatable map to the history of all changes made
Integrated with RStudio
Everyone struggles with reproducibility and it is a hindrance to moving science forward
Evan with a fairly simple analysis challenges were faced in four main areas: organization, documentation, automation, and dissemination
Over the two day workshop, data analysis tasks will become more complex as we gather more data and ask more complicated questions, so we need better tools and workflows to combat issues arising in these areas
#1 Adopt a reproducible research workflow
#2 Train new researchers who don’t have any other workflow